This week's lab is focused on outlier detection and data cleaning. At the end of the lab, you should be able to use pandas to:
Let's start by making sure that plots are displayed inline by issuing the magic command %matplotlib inline
and importing pandas in the usual way.
In [ ]:
%matplotlib inline
import pandas as pd
Next, let's load the data. Write the path to your iris.csv file (i.e. the one from Lab 02) in the cell below:
In [ ]:
path_to_csv = "data/iris.csv"
Execute the cell below to load the data into a pandas data frame and index that data frame by the species
and sample_number
columns:
In [ ]:
df = pd.read_csv(path_to_csv, index_col=['species', 'sample_number'])
df.head()
In [ ]:
df.plot(kind='hist');
We also saw how data frame indexing can be used to limit our view of the data to just one species of Iris. For instance, to plot a histogram for each column in our data frame, but only for the rows corresponding to Iris versicolor, we can write:
In [ ]:
versicolor = df.loc['versicolor']
versicolor.plot(kind='hist');
Plotting multiple histograms on one chart can be a little cluttered though. We also saw how we could create individual charts for each column by passing subplots=True
when we call the plot
method, like this:
In [ ]:
versicolor.plot(kind='hist', subplots=True, layout=(2,2), figsize=(12,6));
This is much more useful, but the histograms look a bit chunky because the default number of bins is set to ten. We can change this easily though, by passing the optional bins
argument to the plot
method, like in the cell below.
Note: By default,
bins=10
unless other specified. Increasing the number of bins results in a "higher resolution" histogram, but comes at the cost of additional visual complexity. The trade off here is important. If we set the number of bins to be a very large number, the histogram will become much more detailed, but also more difficult to understand and interpret. On the other hand, if the number of bins is too small, then the bin widths will be very wide (i.e. the histogram will look "chunky") and important details may be hidden.Choosing the right number of bins depends on your data and how much detail you're looking for, so it can change from situation to situation. As a general rule, you should stick with the default setting initially, and only increase or decrease this if you feel that it is necessary.
In [ ]:
versicolor.plot(kind='hist', subplots=True, layout=(2,2), figsize=(12,6), bins=30);
Increasing the number of bins gives us a more detailed view of how the data is behaving, which can often make it easier to detect outliers visually. In this instance, however, it seems that all of the data is reasonably well behaved - there are no obvious extreme values.
Boxplots offer an alternative method for visually detecting outliers. In pandas, boxplots aren't supported through the standard plot
method, but instead through a separate boxplot
method. However, apart from this, they operate in more or less the same way, like in the cell below.
Note: Depending on the version of pandas you are running, calling the
boxplot
method may generate a warning about thereturn_type
argument not being set. This is just a warning to users that this functionality may change in a future release, and can safely be ignored as the behaviour in either case will not affect the result of the plotting call for our purposes.
In [ ]:
versicolor.boxplot();
As you can see, pandas creates a boxplot for each column in our data frame and places all four boxplots in the same chart, so that we can compare the distributions of the data in the columns side by side.
Inspecting the boxplots, it becomes clear that (at least according to the logic of the boxplot test) there are some outlying observations in our petal length data. In this instance, the outlier is not far from the lower whisker of the box plot (i.e. it's not a very extreme value), and so we may not want to go to the effort of dealing with it because it may not affect the outcome of any further analysis very severly. However, let's consider that it is an undesirable observation and we want to deal with it in some fashion.
As we discussed in this week's lecture, we have three options for dealing with outliers:
In this instance, we don't have a particular modelling technique in mind, so adjusting how we model the data isn't really an option. However, we can choose to either remove the data or replace it with some value that would be considered reasonable.
In order to remove an observation, we must first identify its indices in the data frame. We can do this by manually computing the whisker values and using them to identify the locations of the outliers:
Note: Typically, the lower whisker in a boxplot is set to be $1.5 \times \text{IQR}$ below the bottom edge of the box, while the upper whisker is set to be $1.5 \times \text{IQR}$ above the top edge of the box, where $\text{IQR}$ is the interquartile range, i.e. the distance betwen the top and bottom edges of the box.
In [ ]:
# Here, q1 = first quartile, q3 = third quartile, iqr = interquartile range, lw = lower whisker, uw = upper whisker
q1 = versicolor.quantile(0.25)
q3 = versicolor.quantile(0.75)
iqr = q3 - q1
lw = q1 - 1.5 * iqr
uw = q3 + 1.5 * iqr
# Outliers are below the lower whisker OR above the upper whisker
outliers = (versicolor < lw) | (versicolor > uw)
# Print the last few rows of "outliers"
outliers.tail()
As you can see, the outlier occurs in the 49th row of the data frame.
To remove the row containing the outlying value, we first compute a copy of the data frame without the outlying value. To do this, we can just select all the entries not contained (~
) in the outliers
variable we computed earlier, like this:
In [ ]:
versicolor[~outliers].tail()
Next, we call the dropna
method on the dataframe to remove all the rows containing outlying values:
In [ ]:
removed = versicolor[~outliers].dropna()
removed.tail() # Just show the last five rows
As you can see, the 49th row (where the outlier was) has now been removed.
If we have multiple rows and columns of data, then removing one point means we must remove the entire row or the entire column it belongs to. This is often inconvenient because we end up removing several more data points than just the one we intended to, and so our sample becomes smaller.
One alternative to removing a data point is to replace it with a suitable substitute value. Determining an appropriate substitution can be subjective, but two commonly used choices are the mean and the median. Let's replace the outlying point in our original data frame (i.e. df
) with the median value of the sample it belongs to. To do this, we must first compute the median value of the sample, which we can do using the median
method of the data frame, just like in Lab 02:
In [ ]:
versicolor.median()
To set the new value, we first compute a copy of the data frame without the outlying value, just like earlier. Then, we can call the fillna
method to fill any missing column values with the median values of those columns, like this:
In [ ]:
replaced = versicolor[~outliers].fillna(versicolor.median())
replaced.tail() # Just show the last five rows
As you can see, the petal length value in row 49 has been replaced by the median value of the column.